{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 06: NGrams - Basic Natural Language Processing\n",
    "\n",
    "In *Natural Language Processing*, one goal is to determine the sentiment or meaning of text. One technique that helps do this is to locate the most frequently-occurring, N-word phrases, or *NGrams*. Longer NGrams can convey more meaning, but they occur less frequently so few of them stand out as important. Shorter NGrams have better statistics, but each one conveys less meaning. In most cases, `N = 3-5` appears to provide the best balance.\n",
    "\n",
    "This exercise finds all NGrams matching a user-specified pattern. The default is the 4-word phrases the form `% love % %`, where the `%` are wild cards and the spaces ` ` will be treated as matches for any whitespace (but not crossing line boundaries). In other words, all 4-grams are found with `love` as the second word. The `%` are conveniences; the user can also specify an NGram Phrase that is a regular expression or a mixture, e.g., `% (hat|lov)ed? % %` finds all the phrases with `love`, `loved`, `hate`, or `hated` as the second word. \n",
    "\n",
    "> **Note:** All text is converted to lower case (but see the comment where you could change that). The ngrams matching string is not converted, so use lower case words!\n",
    "\n",
    "See the corresponding Spark job [NGrams6.scala](https://github.com/deanwampler/spark-scala-tutorial/blob/master/src/main/scala/sparktutorial/NGrams6.scala)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "in = ../data/shakespeare/all-shakespeare.txt\n",
       "count = 100\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "100"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "val in = \"../data/shakespeare/all-shakespeare.txt\"  // The plays of Shakespeare!\n",
    "val count = 100               // Upper limit on how many ngrams to return"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "ngramsStr = % love % %\n",
       "ngramsRE = \\w+\\s+love\\s+\\w+\\s+\\w+\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "\\w+\\s+love\\s+\\w+\\s+\\w+"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "val ngramsStr = \"% love % %\"    // Edit to define what ngrams you want to see.\n",
    "\n",
    "// Convert the string into a valid regex:\n",
    "val ngramsRE = ngramsStr.replaceAll(\"%\", \"\"\"\\\\w+\"\"\").replaceAll(\"\\\\s+\", \"\"\"\\\\s+\"\"\").r"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Recall this helper function from 03_WordCount. \n",
    "\n",
    "> **Note:** If you don't want the text converted to lower case, remove the call to `toLowerCase`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "toText: (str: String)String\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "def toText(str: String): String = {\n",
    "  val ary = str.toLowerCase.split(\"\\\\s*\\\\|\\\\s*\")\n",
    "  if (ary.length > 0) ary.last else \"\"\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Define an object for sorting. We'll want to sort by count, _descending_, but when two counts are equal, we'll _secondary sort_ by the ngram itself. The latter step is mostly useful for repeatable unit tests! "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "defined object CountOrdering\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "object CountOrdering extends Ordering[(String,Int)] {\n",
    "  def compare(a:(String,Int), b:(String,Int)) = {\n",
    "    val cntdiff = b._2 compare a._2    // b compare a for these counts means DESCENDING order\n",
    "    if (cntdiff != 0) cntdiff else (a._1 compare b._1)\n",
    "  }\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now extract the NGrams:\n",
    "\n",
    "1. Read the text file(s).\n",
    "2. Parse each line into words.\n",
    "3. For each line, find all matches for the ngrams regex.\n",
    "4. Do _Word Count_, but with each NGram as a key, not each word.\n",
    "5. Use `takeOrdered`, an efficient function that takes the top `count` records, according to the ordering defined by `CountOrdering`. Because it does descending count ordering, we'll return the `count` most frequently occuring ngrams!\n",
    "\n",
    "Note that 3. implies that we don't match across line boundaries. (See the exercises.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "ngrams = Array((i love thee not,7), (the love i bear,4), (you love me not,4), (i love and honour,3), (i love him not,3), (i love him well,3), (i love thee better,3), (i love thee in,3), (i love you not,3), (in love with him,3), (my love to thee,3), (for love of her,2), (for love of you,2), (he love her not,2), (i love my country,2), (i love not to,2), (i love the king,2), (i love thee more,2), (i love thee well,2), (i love thy daughter,2), (i love to hear,2), (i love you more,2), (i love you the,2), (if love be blind,2), (if love make me,2), (in love or that,2), (in love with a,2), (in love with beatrice,2), (in love with her,2), (in love with me,2), (in love with my,2), (in love with thee,2), (my love and duty,2), (my love is as,2), (my love swears that,2), (my lo...\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "[(i love thee not,7), (the love i bear,4), (you love me not,4), (i love and honour,3), (i love him not,3), (i love him well,3), (i love thee better,3), (i love thee in,3), (i love you not,3), (in love with him,3), (my love to thee,3), (for love of her,2), (for love of you,2), (he love her not,2), (i love my country,2), (i love not to,2), (i love the king,2), (i love thee more,2), (i love thee well,2), (i love thy daughter,2), (i love to hear,2), (i love you more,2), (i love you the,2), (if love be blind,2), (if love make me,2), (in love or that,2), (in love with a,2), (in love with beatrice,2), (in love with her,2), (in love with me,2), (in love with my,2), (in love with thee,2), (my love and duty,2), (my love is as,2), (my love swears that,2), (my love to her,2), (my love to you,2), (of love i mean,2), (of love in her,2), (the love i bore,2), (the love of god,2), (the love of laughter,2), (thy love to me,2), (you love my son,2), (you love the gentleman,2), (you love the maid,2), (your love to her,2), (a love even such,1), (a love so vile,1), (a love that makes,1), (a love that your,1), (a love to see,1), (all love the womb,1), (all love to see,1), (ambitious love hath so,1), (and love and great,1), (and love and quiet,1), (and love are still,1), (and love as mine,1), (and love bids me,1), (and love doth mince,1), (and love for love,1), (and love holds quantity,1), (and love my cousin,1), (and love my friend,1), (and love of us,1), (and love sir thurio,1), (and love that still,1), (and love thee after,1), (and love thee no,1), (and love thee too,1), (and love them not,1), (and love this man,1), (and love thy misery,1), (and love to richard,1), (and love were young,1), (and love with me,1), (and love you all,1), (and love you well,1), (angels love good men,1), (any love you owe,1), (are love not a,1), (as love between them,1), (as love doth give,1), (as love hath pursued,1), (as love in twain,1), (as love is blind,1), (as love is full,1), (as love not sorrow,1), (because love is blind,1), (becomed love i might,1), (best love and credence,1), (best love draw to,1), (betrothed love and now,1), (better love a dream,1), (bought love with such,1), (boy love is perjured,1), (buried love doth live,1), (but love from love,1), (but love from us,1)]"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "val ngrams = sc.textFile(in)\n",
    "  .flatMap { line =>\n",
    "    val text = toText(line)\n",
    "    ngramsRE.findAllMatchIn(text).map(_.toString)   // Find matches only per line. Return matching strings\n",
    "  }\n",
    "  .map(ngram => (ngram, 1))                         // Word Count, but using the ngrams\n",
    "  .reduceByKey(_ + _)\n",
    "  .takeOrdered(count)(CountOrdering)                // Take the top \"count\" most frequently occurring ngrams"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that we could have used `.sortByKey(descending = false).take(n)`, but it's much less efficient to sort the whole data set, then throw away most of it!\n",
    "\n",
    "Note that `takeOrdered` returns a Scala `Array[(String,Int)]`, not an `RDD`.\n",
    "\n",
    "Let's format the output a little better:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "outputLines = Array(Found 100 ngrams:, \"               i love thee not\t7\", \"               the love i bear\t4\", \"               you love me not\t4\", \"             i love and honour\t3\", \"                i love him not\t3\", \"               i love him well\t3\", \"            i love thee better\t3\", \"                i love thee in\t3\", \"                i love you not\t3\", \"              in love with him\t3\", \"               my love to thee\t3\", \"               for love of her\t2\", \"               for love of you\t2\", \"               he love her not\t2\", \"             i love my country\t2\", \"                 i love not to\t2\", \"               i love the king\t2\", \"              i love thee more\t2\", \"              i love thee well\t2\", \"           i love thy daughter\t2\", \"                i love...\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "[Found 100 ngrams:,                i love thee not\t7,                the love i bear\t4,                you love me not\t4,              i love and honour\t3,                 i love him not\t3,                i love him well\t3,             i love thee better\t3,                 i love thee in\t3,                 i love you not\t3,               in love with him\t3,                my love to thee\t3,                for love of her\t2,                for love of you\t2,                he love her not\t2,              i love my country\t2,                  i love not to\t2,                i love the king\t2,               i love thee more\t2,               i love thee well\t2,            i love thy daughter\t2,                 i love to hear\t2,                i love you more\t2,                 i love you the\t2,               if love be blind\t2,                if love make me\t2,                in love or that\t2,                 in love with a\t2,          in love with beatrice\t2,               in love with her\t2,                in love with me\t2,                in love with my\t2,              in love with thee\t2,               my love and duty\t2,                  my love is as\t2,            my love swears that\t2,                 my love to her\t2,                 my love to you\t2,                 of love i mean\t2,                 of love in her\t2,                the love i bore\t2,                the love of god\t2,           the love of laughter\t2,                 thy love to me\t2,                you love my son\t2,         you love the gentleman\t2,              you love the maid\t2,               your love to her\t2,               a love even such\t1,                 a love so vile\t1,              a love that makes\t1,               a love that your\t1,                  a love to see\t1,              all love the womb\t1,                all love to see\t1,         ambitious love hath so\t1,             and love and great\t1,             and love and quiet\t1,             and love are still\t1,               and love as mine\t1,               and love bids me\t1,            and love doth mince\t1,              and love for love\t1,        and love holds quantity\t1,             and love my cousin\t1,             and love my friend\t1,                 and love of us\t1,            and love sir thurio\t1,            and love that still\t1,            and love thee after\t1,               and love thee no\t1,              and love thee too\t1,              and love them not\t1,              and love this man\t1,            and love thy misery\t1,            and love to richard\t1,            and love were young\t1,               and love with me\t1,               and love you all\t1,              and love you well\t1,           angels love good men\t1,               any love you owe\t1,                 are love not a\t1,           as love between them\t1,              as love doth give\t1,           as love hath pursued\t1,               as love in twain\t1,               as love is blind\t1,                as love is full\t1,             as love not sorrow\t1,          because love is blind\t1,           becomed love i might\t1,         best love and credence\t1,              best love draw to\t1,         betrothed love and now\t1,            better love a dream\t1,          bought love with such\t1,           boy love is perjured\t1,          buried love doth live\t1,             but love from love\t1,               but love from us\t1]"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "val outputLines = s\"Found ${ngrams.size} ngrams:\" +:\n",
    "  ngrams.map {\n",
    "    case (ngram, count) => \"%30s\\t%d\".format(ngram, count)\n",
    "  }"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's print the first 20 a bit more legibly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Found 100 ngrams:\n",
      "               i love thee not\t7\n",
      "               the love i bear\t4\n",
      "               you love me not\t4\n",
      "             i love and honour\t3\n",
      "                i love him not\t3\n",
      "               i love him well\t3\n",
      "            i love thee better\t3\n",
      "                i love thee in\t3\n",
      "                i love you not\t3\n",
      "              in love with him\t3\n",
      "               my love to thee\t3\n",
      "               for love of her\t2\n",
      "               for love of you\t2\n",
      "               he love her not\t2\n",
      "             i love my country\t2\n",
      "                 i love not to\t2\n",
      "               i love the king\t2\n",
      "              i love thee more\t2\n",
      "              i love thee well\t2\n"
     ]
    }
   ],
   "source": [
    "outputLines.take(20).foreach(println)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Clearly, many people were _not_ in love with each other in Shakespeare!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Save the output, if you want. We've seen most of what's interesting and if you save the output, then rerun, you'll have to either delete the old output or specify a different output directory:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "out = output/ngrams\n",
       "output = ParallelCollectionRDD[6] at makeRDD at <console>:42\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "ParallelCollectionRDD[6] at makeRDD at <console>:42"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "val out = \"output/ngrams\"\n",
    "val output = sc.makeRDD(outputLines)  // convert back to an RDD\n",
    "output.saveAsTextFile(out)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Recap\n",
    "\n",
    "NGrams are not only useful, they can be fun to play with..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercises"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 1: Try different input texts and NGram specifications\n",
    "\n",
    "For example, try the `% (hat|lov)ed? % %` specification. Try the Bible texts."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 2: Match NGrams across lines\n",
    "\n",
    "This is harder, as it's difficult to create an RDD of lines, then match across lines. However, try using the same code we used for the `Crawl` step in notebook 5 for the _Inverted Index_. Recall that we used a method of reading a whole file as a single record. That will work fine as long as no file is too big."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Apache Toree - Scala",
   "language": "scala",
   "name": "apache_toree_scala"
  },
  "language_info": {
   "codemirror_mode": "text/x-scala",
   "file_extension": ".scala",
   "mimetype": "text/x-scala",
   "name": "scala",
   "pygments_lexer": "scala",
   "version": "2.11.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
